Goto

Collaborating Authors

 entropy number


Reliable Estimation of KLDivergence using a Discriminator in Reproducing Kernel Hilbert Space Supplementary Material

Neural Information Processing Systems

Organization: This supplementary material is presented in a format parallel to the main paper. The section numbers and titles are consistent with the main paper. But, here we also add one new section: Section 10 where we describe the societal impacts and possible negative impacts of the paper. Similarly, the Theorem numbers are consistent with the main paper, but we also have several additional theorems and lemmas which were not included in the main paper. GAN-type Objective for KLEstimation Let f be a discriminator, f: X IR. Let p(x) and q(x) be two probability density functions defined over the space X.


Reviews: Fast learning rates with heavy-tailed losses

Neural Information Processing Systems

This paper provides some new results in an important area which is receiving more and more attention: fast rates when loss functions are unbounded and heavy-tailed. Existing results based on empirical process theory often rely on bounded or sub-Gaussian loss, and the heavy tails (hence non-sub-Gaussian) case is considerably harder. The results presented seem sound and are definitely novel. They rely on results of Sara van de Geer and collaborators on concentration inequalities for unbounded empirical processes. The material is very technical and I would suggest moving even some more material to the appendix.


Neural networks: deep, shallow, or in between?

arXiv.org Machine Learning

The fascinating new developments in the area of Artificial Intelligence (AI) and other important applications of neural networks prompt the need for a theoretical mathematical study of their potential to reliably approximate complicated objects. Various network architectures have been used in different applications with substantial success rates without significant theoretical backing of the choices made. Thus, a natural question to ask is whether and how the architecture chosen affects the approximation power of the outputs of the resulting neural network. In this paper, we attempt to clarify how the width and the depth of a feed-forward neural network affect its worst performance. More precisely, we provide estimates from below for the error of approximation of a compact subset K X of a Banach space X by the outputs of feedforward neural networks (NNs) with width W, depth l, bound w(W,l) on their parameters, and Lipschitz activation functions. Note that the ReLU function is included in our investigation since it is a Lipschitz function with a Lipschitz constant L = 1. To prove our results, we assume that we know lower bounds on the entropy numbers of the compact sets K that we approximate by the outputs of feed-forward NNs.


Limitations on approximation by deep and shallow neural networks

arXiv.org Artificial Intelligence

Since neural network approximation is the method of choice in building numerical algorithms in many application areas, it is important to understand not only how well they approximate but also any lower bounds on their approximation power. In this paper, we study the limitations of deep and shallow neural networks to approximate a compact subset K X of a Banach space X when it is required that the parameters in the approximation procedure have certain bounds. This is done by proving appropriate Carl's type inequalities that relate the error of neural network approximation of K to the entropy numbers of this set. We consider feed-forward neural networks (NN) with ReLU or Lipschitz sigmoidal activation functions, width W 2 and depth n, whose parameters have absolute values bounded by a given function w ( n). We prove that the capabilities of these networks to approximate any compact subset K is limited by the behavior of its entropy numbers.


$L^p$ sampling numbers for the Fourier-analytic Barron space

arXiv.org Artificial Intelligence

In this paper, we consider Barron functions $f : [0,1]^d \to \mathbb{R}$ of smoothness $\sigma > 0$, which are functions that can be written as \[ f(x) = \int_{\mathbb{R}^d} F(\xi) \, e^{2 \pi i \langle x, \xi \rangle} \, d \xi \quad \text{with} \quad \int_{\mathbb{R}^d} |F(\xi)| \cdot (1 + |\xi|)^{\sigma} \, d \xi < \infty. \] For $\sigma = 1$, these functions play a prominent role in machine learning, since they can be efficiently approximated by (shallow) neural networks without suffering from the curse of dimensionality. For these functions, we study the following question: Given $m$ point samples $f(x_1),\dots,f(x_m)$ of an unknown Barron function $f : [0,1]^d \to \mathbb{R}$ of smoothness $\sigma$, how well can $f$ be recovered from these samples, for an optimal choice of the sampling points and the reconstruction procedure? Denoting the optimal reconstruction error measured in $L^p$ by $s_m (\sigma; L^p)$, we show that \[ m^{- \frac{1}{\max \{ p,2 \}} - \frac{\sigma}{d}} \lesssim s_m(\sigma;L^p) \lesssim (\ln (e + m))^{\alpha(\sigma,d) / p} \cdot m^{- \frac{1}{\max \{ p,2 \}} - \frac{\sigma}{d}} , \] where the implied constants only depend on $\sigma$ and $d$ and where $\alpha(\sigma,d)$ stays bounded as $d \to \infty$.


Optimal learning of high-dimensional classification problems using deep neural networks

arXiv.org Machine Learning

We study the problem of learning classification functions from noiseless training samples, under the assumption that the decision boundary is of a certain regularity. We establish universal lower bounds for this estimation problem, for general classes of continuous decision boundaries. For the class of locally Barron-regular decision boundaries, we find that the optimal estimation rates are essentially independent of the underlying dimension and can be realized by empirical risk minimization methods over a suitable class of deep neural networks. These results are based on novel estimates of the $L^1$ and $L^\infty$ entropies of the class of Barron-regular functions.


Optimal Approximation Rates and Metric Entropy of ReLU$^k$ and Cosine Networks

arXiv.org Machine Learning

This article addresses several fundamental issues associated with the approximation theory of neural networks, including the characterization of approximation spaces, the determination of the metric entropy of these spaces, and approximation rates of neural networks. For any activation function $\sigma$, we show that the largest Banach space of functions which can be efficiently approximated by the corresponding shallow neural networks is the space whose norm is given by the gauge of the closed convex hull of the set $\{\pm\sigma(\omega\cdot x + b)\}$. We characterize this space for the ReLU$^k$ and cosine activation functions and, in particular, show that the resulting gauge space is equivalent to the spectral Barron space if $\sigma=\cos$ and is equivalent to the Barron space when $\sigma={\rm ReLU}$. Our main result establishes the precise asymptotics of the $L^2$-metric entropy of the unit ball of these guage spaces and, as a consequence, the optimal approximation rates for shallow ReLU$^k$ networks. The sharpest previous results hold only in the special case that $k=0$ and $d=2$, where the metric entropy has been determined up to logarithmic factors. When $k > 0$ or $d > 2$, there is a significant gap between the previous best upper and lower bounds. We close all of these gaps and determine the precise asymptotics of the metric entropy for all $k \geq 0$ and $d\geq 2$, including removing the logarithmic factors previously mentioned. Finally, we use these results to quantify how much is lost by Barron's spectral condition relative to the convex hull of $\{\pm\sigma(\omega\cdot x + b)\}$ when $\sigma={\rm ReLU}^k$.


Learning Rates for Kernel-Based Expectile Regression

arXiv.org Machine Learning

Conditional expectiles are becoming an increasingly important tool in finance as well as in other areas of applications. We analyse a support vector machine type approach for estimating conditional expectiles and establish learning rates that are minimax optimal modulo a logarithmic factor if Gaussian RBF kernels are used and the desired expectile is smooth in a Besov sense. As a special case, our learning rates improve the best known rates for kernel-based least squares regression in this scenario. Key ingredients of our statistical analysis are a general calibration inequality for the asymmetric least squares loss, a corresponding variance bound as well as an improved entropy number bound for Gaussian RBF kernels.


The Entropy Regularization Information Criterion

Neural Information Processing Systems

Effective methods of capacity control via uniform convergence bounds for function expansions have been largely limited to Support Vector machines, where good bounds are obtainable by the entropy number approach. We extend these methods to systems with expansions in terms of arbitrary (parametrized) basis functions and a wide range of regularization methods covering the whole range of general linear additive models. This is achieved by a data dependent analysis of the eigenvalues of the corresponding design matrix.


The Entropy Regularization Information Criterion

Neural Information Processing Systems

Effective methods of capacity control via uniform convergence bounds for function expansions have been largely limited to Support Vector machines, where good bounds are obtainable by the entropy number approach. We extend these methods to systems with expansions in terms of arbitrary (parametrized) basis functions and a wide range of regularization methods covering the whole range of general linear additive models. This is achieved by a data dependent analysis of the eigenvalues of the corresponding design matrix.